9/10/2020

Agenda

  • Simple Linear regression
  • Multiple Linear regression
  • Qualitative predictors (dummy variables)
  • Extensions: interactions, nonlinear effects
  • Potential issues: outliers and assumption verification (residual plots and transformations)

Introduction

Linear regression is a simple approach to the generic supervised learning problem \[Y = f(X_1,X_2,\dots,X_p) + \varepsilon.\] It assumes that the dependence of \(Y\) on \(X_1,X_2,\dots,X_p\) is linear, therefore it is a parametric model.

We can can write down \(f(\mathbf{X})\) in the form of an equation: \[y = \beta_0 + \beta_1 x_1 + \dots + \beta_p x_p + \varepsilon = \mathbf{x}^\intercal \cdot \boldsymbol{\beta} + \varepsilon,\] identically and independently for every data point, where \(\mathbf{x} = (1,x_1,x_2,\dots,x_p)\) and \(\boldsymbol{\beta} = (\beta_0, \beta_1, \beta_2, \dots, \beta_p)\).

Introduction

A notational convenience: the intercept gets absorbed into the vector of predictors by including a leading term of \(1\).

Design matrix \(\mathbf{X}\):

\[\mathbf{X} = \left[ \begin{array}{cccc} 1 & x_{11} & \dots & x_{1p} \\ 1 & x_{21} & \dots & x_{2p} \\ \vdots & & \ddots & \vdots \\ 1 & x_{n1} & \dots & x_{np} \end{array} \right], \quad \boldsymbol{\beta} = \left[ \begin{array}{c} \beta_{0} \\ \beta_{1} \\ \vdots \\ \beta_{p} \end{array} \right], \quad \mathbf{y} = \left[ \begin{array}{c} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{array} \right], \quad \boldsymbol{\varepsilon} = \left[ \begin{array}{c} \varepsilon_{1} \\ \varepsilon_{2} \\ \vdots \\ \varepsilon_{n} \end{array} \right]\]

\[y_{i} = \beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip} + \varepsilon_{i} \quad \forall i = 1, \dots, n\] \[\mathbf{y} = \mathbf{X} \cdot \boldsymbol{\beta} + \boldsymbol{\varepsilon}\]

Introduction

  • Simple linear regression: \(Y\) is univariate, \(p = 1\): covered in any Intro to Stat course
  • Multiple linear regression: \(Y\) is univariate, \(p > 1\): we discuss this in this course
  • Multivariate multiple linear regression: \(Y\) is a vector, \(p > 1\): more advanced topics

Assumptions:

  1. linearity of the underlying function
  2. error terms are uncorrelated
  3. error terms have constant variance \(\sigma^2\)
  4. error terms are normally distributed

(2)+(3)+(4):

\[\varepsilon_1, \dots, \varepsilon_n \stackrel{iid}{\sim} \mathcal{N}(0, \sigma^2)\]

Linear models: major pros

  • They are a simple foundation for more sophisticated techniques (that are often extensions of the linear regression)
  • They are efficient to estimate, even for extremely large data sets
  • They have lower estimation variance than most nonlinear/nonparametric alternatives
  • They are easier to interpret than nonparametric models
  • Techniques for feature selection (choosing which \(x_j\)’s matter) are very well developed for linear models

Linear models: major cons

  • Linear models are pretty much always wrong; if the truth is nonlinear, then the linear model will provide a biased estimate
  • Linear models depend on which transformation of the data you use, e.g. \(x\) versus \(\log(x)\)
  • Linear models do not estimate interactions among the predictors unless you explicitly build them in

Univariate linear regression

Example

Example

Notation

We assume a model where \(\beta_0\) and \(\beta_1\) are two unknown parameters that represent the intercept and slope, \(Y = \beta_0 + \beta_1 x + \varepsilon\), and \(\varepsilon\) is the error term.

Given some estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\) for the model coefficients, we predict future outcomes using \[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x\] where \(\hat{y}\) indicates a prediction of \(Y\) on the basis of \(X = x\).

How to find the solution

A reasonable way to fit a line is to minimize the amount by which the fitted value differs from the actual value. This amount is called the residual. Then \(e_i = y_i - \hat{y}_i\) represents the \(i^{th}\) residual.

Ideally we want to minimize the size of all residuals:

  • If they were all zero we would have a perfect line
  • Trade-off between moving closer to some points and at the same time moving away from other points
  • Minimize the “total” of residuals to get best fit

How to find the solution

We define the residual sum of squares (RSS) as \[\begin{aligned} RSS &= e_1^2 + e_2^2 + \dots + e_n^2 \\ &= (y_1 - \hat{\beta}_0 - \hat{\beta}_1 x_1)^2 +(y_2 - \hat{\beta}_0 - \hat{\beta}_1 x_2)^2 +···+(y_n - \hat{\beta}_0 - \hat{\beta}_1 x_n)^2 \end{aligned}\]

The minimizing values for the RSS are

\[\hat{\beta}_1 = r_{xy} \frac{s_y}{s_x}, \quad \hat{\beta}_0 = \overline{y} - \hat{\beta}_1 \overline{x}\]

Sample mean and Sample variance

  • Sample mean: measure of centrality \[\bar{y} = \dfrac{1}{n} \sum_{i=1}^n y_i\]

  • Sample variance: measure of spread \[s_y^2 = \dfrac{1}{n-1} \sum_{i=1}^n (y_i - \overline{y})^2\]

  • Sample standard deviation \[s_y = \sqrt{s_y^2}\]

Correlation

\[\text{Cov}(Y, X) = \text{Cov}(X, Y) = \dfrac{\sum_{i=1}^n (y_i - \overline{y})(x_i - \overline{x})}{n - 1}\]

The correlation \(r_{xy}\) is the standardized covariance: \[r_{xy} = \text{corr}(Y, X) = \dfrac{\text{Cov}(Y, X)}{\sqrt{s_x^2 s_y^2}} = \dfrac{\text{Cov}(Y, X)}{s_x s_y}\]

The correlation is scale invariant and the units of measurement do not matter: it is always true that \(-1 \leq \text{corr}(Y, X) \leq 1\).

This gives the direction (- or +) and strength (0 \(\rightarrow\) 1) of the linear relationship between \(X\) and \(Y\).

Correlation

Only measures linear relationships: \(\text{corr}(X, Y) = 0\) does not mean the variables are not related!

Also be careful with influential observations.

Correlation

Correlation

  • Be careful: correlation and dependence are two very different probabilitstic concepts
  • The two things are equivalent only for normal random variables
  • In the general case, independence implies uncorrelation, but not vice versa

https://www.tylervigen.com/spurious-correlations

For the same reason, the linear regression cannot be interpreted in a causal way.

Back to the least squares solution

  1. Intercept: \[\hat{\beta}_0 = \overline{Y} - \hat{\beta}_1 \overline{X} \Rightarrow \overline{Y} = \hat{\beta}_0 + \hat{\beta}_1 \overline{X}\]
    • The point \((\overline{X}, \overline{Y})\) is on the regression line!
    • Least squares finds the point of means and rotates the line through that point until getting the “right” slope
  2. Slope: \[\hat{\beta}_1 = \text{corr}(X,Y) \dfrac{s_y}{s_x}\]
    • So, the right slope is the correlation coefficient times a scaling factor that ensures the proper units for \(\hat{\beta}_1\)

Assessing model accuracy

Remember that \(Y = \widehat{Y} + \varepsilon\). Since \(\widehat{Y}\) and \(\varepsilon\) are uncorrelated, i.e. \(\text{corr}(\widehat{Y}, \varepsilon) = 0\),

\[ \begin{aligned} &\text{Var}(Y) = \text{Var}(\widehat{Y} + \varepsilon) = \text{Var}(\widehat{Y}) + \text{Var}(\varepsilon) \\ &\sum_{i=1}^n (y_i - \overline{y}_i) = \sum_{i=1}^n (\hat{y}_i - \overline{\hat{y}}_i) + \sum_{i=1}^n (e_i - \overline{e})^2 \end{aligned} \]

Given that \(\overline{e} = 0\) and \(\overline{\hat{y}} = \overline{y}\) (why?), we get \[\underbrace{\sum_{i=1}^n (y_i - \overline{y}_i)^2}_{\substack{\text{Total Sum of} \\ \text{Squares (TSS)}}} = \underbrace{\sum_{i=1}^n (\hat{y}_i - \overline{y}_i)^2}_{\substack{\text{Model Sum of} \\ \text{Squares (MSS)}}} + \underbrace{\sum_{i=1}^n e_i^2}_{\substack{\text{Residual Sum of} \\ \text{Squares (RSS)}}}\]

Assessing model accuracy

The coefficient of determination, denoted by \(R^2\), measures goodness of fit:

\[R^2 = \dfrac{MSS}{TSS} = \dfrac{TSS-RSS}{TSS} = 1 - \dfrac{RSS}{TSS}\]

  • \(0 < R^2 < 1\)
  • The closer \(R^2\) is to \(1\), the “better” the fit
  • \(R^2\) is large when a big part of the total sum of squares (TSS) is explained by the model sum of squares (MSS), relatively compared to the residual sum of squares (RSS)
  • \(R^2\) is small when the linear model assumptions are not met, or the inherent error \(\sigma^{2}\) is large

Assessing the estimate accuracies

The standard error of an estimator reflects how it varies under repeated sampling. We have (large \(n\) approximation): \[ \begin{aligned} &\text{SE}(\hat{\beta}_1)^2 = \sigma^2 \dfrac{1}{\sum_{i=1}^n (x_i - \overline{x})^2} \approx s^2 \dfrac{1}{\sum_{i=1}^n (x_i - \overline{x})^2} \\ &\text{SE}(\hat{\beta}_0)^2 = \sigma^2 \left[ \dfrac{1}{n} + \dfrac{\overline{x}^2}{\sum_{i=1}^n (x_i - \overline{x})^2} \right] \approx s^2 \left[ \dfrac{1}{n} + \dfrac{\overline{x}^2}{\sum_{i=1}^n (x_i - \overline{x})^2} \right] \end{aligned} \] where \(\sigma^2 = \text{Var}(\varepsilon)\) is estimated via \(s^2\) (more on this to come).

These standard errors can be used to compute confidence intervals (CI). A \(95\%\) confidence interval is \[\text{CI} = \hat{\beta}_1 \pm 2 \text{SE}(\hat{\beta}_1)\]

Hypothesis testing

Standard errors can also be used to perform hypothesis tests on the coefficients. The most common hypothesis test involves testing:

  • \(H_0\): There is no relationship between \(X\) and \(Y\): \(\beta_1 = 0\)
  • \(H_a\): There is some relationship between \(X\) and \(Y\): \(\beta_1 \neq 0\)

If \(\beta_1 = 0\), then the model reduces to \(Y = \beta_0 + \varepsilon\), and \(X\) is not associated with \(Y\).

Hypothesis testing

To test the null hypothesis, we compute a \(t\)-statistic, given by \[t = \dfrac{\hat{\beta}_1 - 0}{\text{SE}(\hat{\beta}_1)}\]

This will have a \(t\)-distribution with \(n-2\) degrees of freedom, assuming \(\beta_1 = 0\).

Using statistical software, it is easy to compute the probability of observing any value equal to \(|t|\) or larger. We call this probability the p-value.

This is equivalent to the confidence interval: a value for \(|t|\) greater than \(1.96\) implies that the proposed value is outside of the confidence interval… therefore reject!

Why CI are better than tests

Suppose in testing \(H_0: \beta_1 = 1\) you got a t-stat of \(6\) and the confidence interval was \([1.00001, 1.00002]\).

  1. Do you reject \(H_0: \beta_1 = 1\)?
  2. Could you justify that to you boss? Probably not! (why?)

Now, suppose in testing \(H_0: \beta_1 = 1\) you got a t-stat of \(-0.02\) and the confidence interval was \([-100, 100]\).

  1. Do you accept \(H_0: \beta_1 = 1\)?
  2. Could you justify that to you boss? Probably not! (why?)

Prediction intervals

There are two things that we want to know:

  • What value of \(Y\) can we expect for a given \(X\)?
  • How sure are we about this forecast? Or how different could \(Y\) be from what we expect?

Goal: measure the accuracy of our forecasts, or how much uncertainty there is in the forecast. One method is to specify a range of \(Y\) values that are likely, given an \(X\) value.

Prediction Interval: probable range for \(Y\)-values given \(X\).

Key Insight: to construct a prediction interval, we will have to assess the likely range of error values corresponding to a \(Y\) value that has not yet been observed!

Prediction intervals

You are told (without looking at the data) that \(\beta_0 = 40; \beta_1 = 45; \sigma = 10\), and you are asked to predict price of a \(1500\) square foot house.

What do you know about \(Y\) from the model?

\[Y = 40 + 45 (1.5) + \varepsilon = 107.5 + \varepsilon\]

Thus our prediction for price is \(Y \mid (X = 1.5) \sim \mathcal{N}(107.5, 10^2)\) and a \(95\%\) Prediction Interval for \(Y\) is \(87.5 < Y < 127.5\).

Prediction intervals

The conditional distribution for Y given X is Normal: \[Y \mid (X = x) \sim \mathcal{N}(\beta_0 + \beta_1 x ,\sigma^2)\]

Estimation of Error variance

We estimate \(s^2\) with: \[s^2 = \dfrac{1}{n-2} \sum_{i=1}^n e_i^2 = \dfrac{RSS}{n-2}\] (\(2\) is the number of regression coefficients; i.e. \(2\) for \(\beta_0\) and \(\beta_1\)).

We usually use \(s = \sqrt{RSS/(n-2)}\), in the same units as \(Y\). It is also called the regression standard error.

Example

\(R^2\) as measure of fit

  • \(R^2 = 82\%\)
  • Great \(R^2\): we are happy using this model to predict house prices, right?

Example

\(s^2\) as measure of precision

  • But, \(s = 14\), leading to a predictive interval width of about US $60,000! How do you feel about the model now?
  • As a practical matter, \(s\) is a much more relevant quantity than \(R^2\).

Regression in R

fit <- lm(y ~ x)
summary(fit)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.830 -11.178   2.211   9.141  20.217 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   38.728      8.938   4.333 0.000401 ***
## x             44.069      3.724  11.833 6.32e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.72 on 18 degrees of freedom
## Multiple R-squared:  0.8861, Adjusted R-squared:  0.8798 
## F-statistic:   140 on 1 and 18 DF,  p-value: 6.324e-10

Multiple linear regression

Sales example

Sales example

Questions we might ask:

  • Is there a relationship between advertising budget and sales?
  • How strong is the relationship between advertising budget and sales? Which media contribute to sales?
  • How large is the effect of each medium on sales?
  • How accurately can we predict future sales?
  • Is the relationship linear?
  • Is there synergy among the advertising media?

The model

Here our model is \[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p + \varepsilon\] We interpret \(\beta_j\) as the average effect on \(Y\) of a one unit increase in \(X_j\) , holding all other predictors fixed. But predictors usually change together!

In the advertising example, the model becomes \[\text{sales} = \beta_0 + \beta_1 \times \text{TV} + \beta_2 \times \text{radio} + \beta_3 \times \text{newspaper} + \varepsilon\]

The model

The ideal scenario is when the predictors are uncorrelated - a balanced design:

  • Each coefficient can be estimated and tested separately
  • Interpretations such as “a unit change in \(X_j\) is associated with a \(\beta_j\) change in \(Y\), while all the other variables stay fixed”, are possible

Correlations amongst predictors cause problems:

  • The variance of all coefficients tends to increase, sometimes dramatically
  • Interpretations become hazardous - when \(X_j\) changes, everything else changes

Notation

  • Fitted Values: \[\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_{i1} + \hat{\beta}_2 x_{i2} + \dots + \hat{\beta}_p x_{ip}\]
  • Residuals: \[e_i = y_i - \hat{y}_i\]
  • Least Squares: find \(\hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2, \dots, \hat{\beta}_p\) to minimize \[RSS = \sum_{i=1}^n e_i^2\]

Geometrical interpretation

What to output

A useful way to plot the results for the multiple linear regression is to look at \(Y\) (observed values) against \(\hat{Y}\) (fitted values).

If things are working, these values should form a nice straight line. Can you guess the slope of the line?

Intervals for individual coefficients

  • \(H_0: \beta_j = 0\)
  • \(H_a: \beta_j \neq 0\)

Intervals and t-statistics are exactly the same as in the simple linear regression.

Important: intervals and testing via \(\hat{\beta}_j\) and \(\text{SE}(\hat{\beta}_j)\) are one-at-a-time procedures: we are evaluating the \(j^{th}\) coefficient conditional on the other \(X\)’s being in the model, but regardless of the values we have estimated for the other \(\hat{\beta}\)’s.

Testing for significance of the model

  • \(H_0: \beta_1 = \dots = \beta_p = 0\)
  • \(H_a: \exists \ j \in \{1, \dots, p\} \text{ s.t. } \beta_j \neq 0\)

We can use the F-statistic:

\[F = \dfrac{(\text{TSS} - \text{RSS})/p}{\text{RSS}/(n-p-1)} \sim F_{p,n-p-1}\] Note that we are not testing the significance of the intercept!

Model Selection

Especially in high dimensions, it is very unlikely that all of the predictors are relevant to explain the outcome variable.

Usually, predictive accuracy (and interpretability) can be by selecting an “optimal” subset of predictors.

More to come on this…

How to interpret multiple linear regression

We have data about monthly sales of a certain item and its corresponding price (P1) from a vendor.

It looks like we should just raise our prices, right?

How to interpret multiple linear regression

The regression equation for Sales on own price is: \[\text{Sales} = 985.5 + 66.5 P1\]

If now we add the competitors price to the regression we get \[\text{Sales} = 98.8 -61.7 P1 + 140.9 P2\]

Does this look better? How did it happen?

Remember: -61.7 is the effect on sales of a change in P1 with P2 held fixed!

How to interpret multiple linear regression

Let us look at a subset of points where P1 varies and P2 is held approximately constant.

For a fixed level of P2, variation in P1 is negatively correlated with with Sales!

How to interpret multiple linear regression

  • In general, when we see a relationship between \(y\) and \(x\) (or \(\mathbf{x}\)’s), that relationship may be driven by variables “lurking” in the background which are related to your current \(\mathbf{x}\)’s.
  • This makes it hard to reliably find “causal” relationships. Any correlation (association) you find could be caused by other variables in the background… correlation is NOT causation
  • Any time a report says two variables are related and there’s a suggestion of a “causal” relationship, ask yourself whether or not other variables might be the real reason for the effect. Multiple regression allows us to control for all important variables by including them into the regression.

Coming up…

  • Simple Linear regression
  • Multiple Linear regression
  • Qualitative predictors (dummy variables)
  • Extensions: interactions, nonlinear effects
  • Potential issues: outliers and assumption verification (residual plots and transformations)

Question time